Welcome back everybody to pattern recognition. So today we want to explore a bit more this
concept of kernels and we will look into something that is called the Merses theorem. So looking
forward to exploring kernel spaces.
So let's have a look into kernels. So far we've seen that the linear decision boundaries in
their current form have serious limitations. It's too simple to provide good decision boundaries.
Non-linearly separable data cannot be classified. Noisy data cause problems and also this formulation
allows you to work with vectorial data only. So one possible solution that we already hinted at
is mapping into a higher dimensional space using a non-linear feature transform and then use a
linear classifier. So we've seen that the SVM decision boundary can be rewritten in dual form
and then we could see that we essentially got rid of the actual normal vector and everything
could be written as sum over the Lagrange multipliers, the class observations and the actual
features and in this particular optimization problem we've seen that the feature vectors only
appear as inner products. So the conclusion is that we only have inner products here so this
can be applied in both the learning and the classification phase. Now let's look again at
this inner product in a bit of more detail. So we've seen that this was also the case in the
perceptron and there we've seen that essentially we were summing up over all the steps during the
training procedure and we already have seen that also in this case we essentially only had inner
products for the decision boundary and we've seen that we only need the observations that actually
produced updates of our decision boundary during the training process. So this is the set E here
if you remember. So again everything can be brought down to inner products. We can use feature
transforms now and this feature transform phi is mapping from space D dimensional to a capital D
dimensional space and now the capital D is larger or equal to D such that the resulting features are
linearly separable. So let's look into one example here we have some original feature space that is
centered around zero and all the observations from one class are in the center and all the
observations from the outer class are outside aligned and it's very clear that this example
cannot be solved with a linear decision boundary but now I take the feature transform phi equals
to x1 square and x2 square so this has exactly the same dimensionality and what we see is by
mapping onto essentially these squared dimensions we can now find a linear decision boundary.
Now this is a simple example it may be more difficult and it already gets difficult
if we start using data that is not centered. In this case if we apply our feature transform
we see that we yeah unfortunately we are not able to separate these data linearly but if we use a
3D transform that still includes for example x2 then you see that we can again find a linear decision
boundary. So this is again the idea of using a polynomial feature transform in order to map to
a higher dimensional space that then will allow us to use the linear decision boundaries.
So we've seen that the decision boundary given by quadratic function is obviously not linear
but because the parameters in A are linear we can map this to this high dimensional feature space
in order to get linear decision boundaries in this transformed high dimensional space.
Now let's consider the distances in the transformed space so I apply phi and I apply to some x and
some x prime and take the two norm. If we scratch this out then you see this is the inner product
of the two differences and then we can essentially write this up and you see that I get essentially
only inner products. So the two norm of our transformed spaces can be written only by using
inner products. So with this kind of feature transform we can even evaluate distances only
by the means of inner products. So this then also means that they can very easily be incorporated
into our support vector machine so the decision boundary is then given as the feature transformed
vectors using the inner product and also the optimization problem can be rewritten with this
inner product so we can integrate this quite easily also into the support vector machine.
This then brings us to the notion of a kernel function a kernel function that is mapping from
two feature domains that are both identical of course x to some value r that is a real value
and it needs to be a symmetric function that maps pairs of features to real numbers
and in this case then the property holds that our k is given as the inner product of the feature
Presenters
Zugänglich über
Offener Zugang
Dauer
00:14:02 Min
Aufnahmedatum
2020-11-11
Hochgeladen am
2020-11-11 17:18:43
Sprache
en-US
In this video, we look at kernels for Support Vector Machines and the Perceptron and learn about Mercer's Theorem.
This video is released under CC BY 4.0. Please feel free to share and reuse.
For reminders to watch the new video follow on Twitter or LinkedIn. Also, join our network for information about talks, videos, and job offers in our Facebook and LinkedIn Groups.
Music Reference: Damiano Baldoni - Thinking of You